fix: DeepSeek-OCR server crash, template routing, and CUDA OOM fallback by soymh · Pull Request #23394 · ggml-org/llama.cpp

soymh · 2026-05-20T08:18:53Z

Upstream PR: #17400 Original implementation by sfallah (sf/deepseek-ocr branch):
https://github.com/sfallah/llama.cpp

Co-authored-by: sfallah sfallah@users.noreply.github.com

Fixes:

server crash: GGML_ASSERT(batch.n_tokens > 0) when mtmd image processing consumes all prompt tokens (inject synthetic token before assertion)
server crash: slot.task NULL dereference after release() on mtmd OOM
server crash: ggml_backend_sched_alloc_graph segfault when CUDA OOM (check return value, matching sfallah's upstream guard pattern)
template routing: --chat-template deepseek-ocr was rendered as Jinja text instead of resolving to the built-in template (auto-detect + legacy fallback)
bitmap/marker mismatch: get_media_marker() returned random string instead of mtmd_default_marker(), causing tokenizer to split on wrong marker
CUDA OOM: tensor loading falls back to CPU backend when GPU allocation fails
mmproj GPU: -ngl 0 now also disables mmproj GPU (matches user intent)

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

Upstream PR: ggml-org#17400 Original implementation by sfallah (sf/deepseek-ocr branch): https://github.com/sfallah/llama.cpp Co-authored-by: sfallah <sfallah@users.noreply.github.com> Fixes: - server crash: GGML_ASSERT(batch.n_tokens > 0) when mtmd image processing consumes all prompt tokens (inject synthetic token before assertion) - server crash: slot.task NULL dereference after release() on mtmd OOM - server crash: ggml_backend_sched_alloc_graph segfault when CUDA OOM (check return value, matching sfallah's upstream guard pattern) - template routing: --chat-template deepseek-ocr was rendered as Jinja text instead of resolving to the built-in template (auto-detect + legacy fallback) - bitmap/marker mismatch: get_media_marker() returned random string instead of mtmd_default_marker(), causing tokenizer to split on wrong marker - CUDA OOM: tensor loading falls back to CPU backend when GPU allocation fails - mmproj GPU: -ngl 0 now also disables mmproj GPU (matches user intent)

ngxson · 2026-05-20T08:26:44Z

there changes are too invasive, break all other models, we cannot accept

ngxson · 2026-05-20T08:27:53Z

ref: #23345

soymh requested review from a team as code owners May 20, 2026 08:18

ngxson closed this May 20, 2026

github-actions Bot added examples server labels May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: DeepSeek-OCR server crash, template routing, and CUDA OOM fallback#23394

fix: DeepSeek-OCR server crash, template routing, and CUDA OOM fallback#23394
soymh wants to merge 1 commit into
ggml-org:masterfrom
soymh:feat/deepseek-ocr

soymh commented May 20, 2026

Uh oh!

ngxson commented May 20, 2026

Uh oh!

ngxson commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

soymh commented May 20, 2026

Overview

Additional information

Requirements

Uh oh!

ngxson commented May 20, 2026

Uh oh!

ngxson commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants